Questions on Linear regression
  1. This question involves the use of simple linear regression on the Auto dataset. This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset has the following fields:
auto<-read.csv("Auto.csv")
str(auto)
## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr  "130" "165" "150" "150" ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
auto$horsepower <- as.numeric(as.character(auto$horsepower))
## Warning: NAs introduced by coercion
model1<- lm(mpg~horsepower, data=auto)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
  1. Comment on the output by answering the following questions: + Is there a strong relationship between the predictor and the response?
    Answer. Yes, there is a strong relationship between the predictor and response. The p-value is almost zero, implying we can reject the null hypothesis that the corresponding \(\beta_1= 0\).
    + Is the relationship between the predictor and the response positive or negative?
    Answer. Since \(\hat{\beta}=-0.1578\), there is a negative linear relatioship between mpg and horsepower.
  2. What is the predicted mpg associated with a horsepower of 98? What is the associated 99% confidence interval?
    Hint: You can check the predict.lm function on how the confidence interval can be computed for predictions with R.
    Answer. The predicted mpg for horsepower of 98 is 24.46708 and the 99% confidence interval is [23.816,25.117].
predict.lm(model1,newdata=data.frame(horsepower=98),interval=c("confidence"),level=.99)
##        fit      lwr      upr
## 1 24.46708 23.81669 25.11747
  1. Compute the correlation between the response and the predictor variable. How does this compare with the \(R^2\) value?
    Answer. We use cor(auto$mpg,auto$horsepower, use = "pairwise.complete.obs") which excludes entries whenever one of the entries is NA. Here \(R^2=0.6059\).
cor(auto$mpg,auto$horsepower, use = "pairwise.complete.obs")
## [1] -0.7784268
cor(auto$mpg,auto$horsepower, use = "pairwise.complete.obs")^2
## [1] 0.6059483
  1. Plot the response and the predictor. Also plot the least squares regression line.
library(ggplot2)
ggplot(auto,aes(horsepower,mpg))+geom_point(na.rm=T)+geom_smooth(method="lm",na.rm=T,se=F)
## `geom_smooth()` using formula 'y ~ x'

# Alternative plotting
#plot(auto$horsepower,auto$mpg)
#abline(model1)
  1. First install the package ggfortify which aids plotting linear models with ggplot2. Use the following two commands in R to produce diagnostic plots of the linear regression fit:
    > library(ggfortify)
    > autoplot(your_model_name)
    Comment on the Residuals versus Fitted plot and the Normal Q-Q plot and on any problems you might see with the fit.
    Answer. A good linear fit should have the residuals randomly scattered. In this model we see that the residuals decrease, and then increase as the number of fitted residuals increase. The normal QQ plot also shows that the distribution of the residuals is not normal at the extreme tails. This indicates that the data might have evidence of some nonlinearities and outliers.
library(ggfortify)
autoplot(model1,label.size = 3)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

# Alternative plotting without ggplot
#layout(matrix(1:4,2,2))
#plot(model1)
  1. This question involves the use of multiple linear regression on the Auto dataset building on Question 1.
  1. Produce a scatterplot matrix which includes all the variables in the dataset.
    Answer. An easy way to do this is using pairs(your_data_name). Instead we use the package psych and pairs.panel.
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
pairs.panels(auto,ellipses = F, lm =T, breaks=10, hist.col="blue")

  1. Compute a matrix of correlations between the variables using the function cor(). You need to exclude the name variable which is qualitative.

Answer. We use subset(auto,select=-c(name)) to remove name from the set. The presence of NA in the variable horsepower makes all the correlations with horsepower NA.

str(auto)
## 'data.frame':    397 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
auto1<-subset(auto,select=-c(name))
str(auto1)
## 'data.frame':    397 obs. of  8 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
cor(auto1)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7762599   -0.8044430         NA -0.8317389
## cylinders    -0.7762599  1.0000000    0.9509199         NA  0.8970169
## displacement -0.8044430  0.9509199    1.0000000         NA  0.9331044
## horsepower           NA         NA           NA          1         NA
## weight       -0.8317389  0.8970169    0.9331044         NA  1.0000000
## acceleration  0.4222974 -0.5040606   -0.5441618         NA -0.4195023
## year          0.5814695 -0.3467172   -0.3698041         NA -0.3079004
## origin        0.5636979 -0.5649716   -0.6106643         NA -0.5812652
##              acceleration       year     origin
## mpg             0.4222974  0.5814695  0.5636979
## cylinders      -0.5040606 -0.3467172 -0.5649716
## displacement   -0.5441618 -0.3698041 -0.6106643
## horsepower             NA         NA         NA
## weight         -0.4195023 -0.3079004 -0.5812652
## acceleration    1.0000000  0.2829009  0.2100836
## year            0.2829009  1.0000000  0.1843141
## origin          0.2100836  0.1843141  1.0000000
  1. Perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Comment on the output by answering the following questions:
    • Is there a strong relationship between the predictors and the response?
      Answer. The p-value is very close to 0 for the multiple regression model. We can reject the null hypotheses that all the \(\beta_i\)’s are zero; hence there is a strong relationship between the predictors and the response.
    • Which predictors appear to have a statistically significant relationship to the response?
      Answer. The variables displacement, weight, and origin are statistically significant at a 1% level.
    • What does the coefficient for the year variable suggest?
      Answer. The coefficient for year is positive and the p-value is close to 0. This shows that year is positively related to mpg where every year adds 0.7508 miles per gallon everything else staying the same.
model2 <-lm(mpg~., data=auto1)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ ., data = auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. This problem focusses on the multicollinearity problem with simulated data.
  1. Perform the following commands in R:
set.seed(1)
x1 <- runif(100)
x2 <- 0.5*x1 + rnorm(100)/10
y <- 2 + 2*x1 + 0.3*x2 + rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?
Answer. The form of the linear model is: \[ y = \beta_0+ \beta_1 x_1 + \beta_2 x_2 +\epsilon\] where the coefficients are \(\beta_0=2, \beta_1=2, \beta_2=0.3\).

  1. What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables. Answer. The correlation is 0.8351. This clearly shows a strong positive correlation between x1 and x2 (as also seen in the scatterplot).
cor(x1,x2)
## [1] 0.8351212
ggplot(cbind(x1,x2,y),aes(x1,x2))+geom_point()

  1. Using the data, fit a least square regression to predict y using x1 and x2.
    • What are the estimated parameters of \(\hat{\beta_0}, \hat{\beta_1}\) and \(\hat{\beta_2}\)? How do these relate to the true \({\beta_0}, {\beta_1}\) and \({\beta_2}\)?
      Answer. From the fit we see that \(\hat{\beta}_0=2.1305, \hat{\beta}_1=1.4396, \hat{\beta}_2=1.0097\). Since the true values are\(\beta_0=2, \beta_1=2, \beta_2=0.3\), there is a discrepancy in the values of \(\beta_1, \hat{\beta_1}\) and \(\beta_2, \hat{\beta_2}\) but less in the case of \(\beta_0, \hat{\beta_0}\)
    • Can you reject the null hypothesis \(H_0:\beta_1=0\)?
    • How about the null hypothesis \(H_0:\beta_2=0\)?
      Answer. From our regression model, we reject the null hypothesis that \(\beta_1=0\) at the 5% level, but we cannot reject the null hypothesis that \(\beta_2=0\) at the 5% level. Note that the answers here may change depending on the data simulated, but most of the time we can reject one but not the other.
model3 <- lm(y~x1+x2)
summary(model3)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05
  1. Now fit a least squares regression to predict y using only x1.
    • How does the estimated \({\hat{\beta_1}}\) relate to the true \({\beta_1}\)?
    • Can you reject the null hypothesis \(H_0:\beta_1=0\)?
      Answer. The estimated \(\hat{\beta_1}=1.9759\) which close to \(\beta_1=2\), and we can reject the null hypothesis that \(H_0:\beta_1=0\) as the p-value is close to zero.
model4 <- lm(y~x1)
summary(model4)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06
  1. Now fit a least squares regression to predict y using only x2.
    • How does the estimated \({\hat{\beta_2}}\) relate to the true \({\beta_2}\)?
    • Can you reject the null hypothesis \(H_0:\beta_2=0\)?
      Answer. The estimated \(\hat{\beta_2}=2.8996\) which is quite far from \(\beta_2=0.3\). We reject the null hypothesis that \(H_0:\beta_2=0\) here as the p-value is close to zero.
model5 <- lm(y~x2)
summary(model5)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05
  1. Provide an explanation on the results in parts (c)-(e).
    Answer. There is multicollinearity in the data between x1 and x2. In doing multiple regression we see this effect where it is difficult to reject \(H_0:\beta_j=0\) (for one of the coefficients), while we see that with a single regression (with one variable), we can reject \(H_0:\beta_j=0\). This is caused by multicollinearity.
  1. This problem involves the Boston dataset. This data was part of an important paper in 1978 by Harrison and Rubinfeld titled ???Hedonic housing prices and the demand for clean air??? published in the Journal of Environmental Economics and Management 5(1): 81-102. The dataset has the following fields:
  1. For each predictor, fit a simple linear regression model using a single variable to predict the response. In which of these models is there a statistically significant relationship between the predictor and the response? Plot the figure of relationship between medv and lstat as an example to validate your finding.
    Answer. We show linear model for model1 and model13 below. It should be done for all models to check that the p-values are close to zero for testing \(H_0: \beta_j=0\) for \(j=1,\ldots,13\).
boston <- read.csv("Boston.csv")
colnames(boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
model1 <- lm(medv~crim, data=boston)
model2 <- lm(medv~zn, data=boston)
model3 <- lm(medv~indus, data=boston)
model4 <- lm(medv~chas, data=boston)
model5 <- lm(medv~nox, data=boston)
model6 <- lm(medv~rm, data=boston)
model7 <- lm(medv~age, data=boston)
model8 <- lm(medv~dis, data=boston)
model9 <- lm(medv~rad, data=boston)
model10 <- lm(medv~tax, data=boston)
model11 <- lm(medv~ptratio, data=boston)
model12 <- lm(medv~black, data=boston)
model13 <- lm(medv~lstat, data=boston)

summary(model1)
## 
## Call:
## lm(formula = medv ~ crim, data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.957  -5.449  -2.007   2.512  29.800 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 24.03311    0.40914   58.74   <2e-16 ***
## crim        -0.41519    0.04389   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.484 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16
# Verify this for all the models by checking that the p-values are close to 0.

summary(model13)
## 
## Call:
## lm(formula = medv ~ lstat, data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
ggplot(boston,aes(lstat,medv))+geom_point(na.rm=T)+geom_smooth(method="lm",na.rm=T,se=F)
## `geom_smooth()` using formula 'y ~ x'

  1. Fit a multiple linear regression models to predict your response using all the predictors. Compare the adjusted \(R^2\) from this model with the simple regression model. For which predictors, can we reject the null hypothesis \(H_0:\beta_j=0\)?
    Answer. The adjusted \(R^2\) is 0.7338 which is larger than the adjusted \(R^2\) from any of the simple regression models. The variables for which we can reject the \(H_0: \beta_j=0\) are crim, zn, chas, nox, rm, dis, rad, tax, ptratio, black, lstat at the 0.05 significance level.
modelall<- lm(medv~., data=boston)
summary(modelall)
## 
## Call:
## lm(formula = medv ~ ., data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas         2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## black        9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
  1. Create a plot displaying the univariate regression coefficients from (a) on the X-axis and the multiple regression coefficients from (b) on the Y-axis. That is each predictor is displayed as a single point in the plot. Comment on this plot.
    Answer. The figure seems to indicate a fairly positive relationship between the results from the simple and multiple linear regression models. The relationship seems to be linear too.
Ind <- c(model1$coef[2], model2$coef[2], model3$coef[2], model4$coef[2], model5$coef[2],
model6$coef[2], model7$coef[2], model8$coef[2], model9$coef[2], model10$coef[2],
model11$coef[2], model12$coef[2], model13$coef[2])
All <- modelall$coef[2:14]
ggplot(cbind(Ind,All),aes(Ind,All)) + geom_point()+geom_smooth(method="lm",se=F)+ggtitle("Coefficient relationship") + xlab("Simple linear regression") + ylab("Multiple linear regression")
## `geom_smooth()` using formula 'y ~ x'

  1. In this question, we will check if there is evidence of non-linear association between the lstat predictor variable and the response? To answer the question, fit a model of the form
    medv = \(\beta_0\)+ \(\beta_0\)lstat + \(\beta_0\)lstat2 + \(\epsilon\).
    You can make use of the poly() function in R. Does this help improve the fit? Add higher degree polynomials to fit the data. What is the degree of the polynomial fit beyond which the terms no longer remain significant?
    Answer. Yes, adding higher-degree terms helps improve the fit. Beyond degree 5, adding additional terms does not seem to improve the model (additional parameters do not remain significant).
    We plot the data points (lstat,medv) along with the linear (blue curve) and polynomial of degree 5 (red curve) fits below.
summary(model13)
## 
## Call:
## lm(formula = medv ~ lstat, data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
modelpoly2 <- lm(medv~poly(lstat,2,raw=TRUE), data = boston)
summary(modelpoly2)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 2, raw = TRUE), data = boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 42.862007   0.872084   49.15   <2e-16 ***
## poly(lstat, 2, raw = TRUE)1 -2.332821   0.123803  -18.84   <2e-16 ***
## poly(lstat, 2, raw = TRUE)2  0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16
modelpoly3 <- lm(medv~poly(lstat,3,raw=TRUE), data = boston)
summary(modelpoly3)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 3, raw = TRUE), data = boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.5441  -3.7122  -0.5145   2.4846  26.4153 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 48.6496253  1.4347240  33.909  < 2e-16 ***
## poly(lstat, 3, raw = TRUE)1 -3.8655928  0.3287861 -11.757  < 2e-16 ***
## poly(lstat, 3, raw = TRUE)2  0.1487385  0.0212987   6.983 9.18e-12 ***
## poly(lstat, 3, raw = TRUE)3 -0.0020039  0.0003997  -5.013 7.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.396 on 502 degrees of freedom
## Multiple R-squared:  0.6578, Adjusted R-squared:  0.6558 
## F-statistic: 321.7 on 3 and 502 DF,  p-value: < 2.2e-16
modelpoly4 <- lm(medv~poly(lstat,4,raw=TRUE), data = boston)
summary(modelpoly4)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 4, raw = TRUE), data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.563  -3.180  -0.632   2.283  27.181 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  5.731e+01  2.280e+00  25.134  < 2e-16 ***
## poly(lstat, 4, raw = TRUE)1 -7.028e+00  7.308e-01  -9.618  < 2e-16 ***
## poly(lstat, 4, raw = TRUE)2  4.955e-01  7.489e-02   6.616 9.50e-11 ***
## poly(lstat, 4, raw = TRUE)3 -1.631e-02  2.994e-03  -5.448 7.98e-08 ***
## poly(lstat, 4, raw = TRUE)4  1.949e-04  4.043e-05   4.820 1.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.28 on 501 degrees of freedom
## Multiple R-squared:  0.673,  Adjusted R-squared:  0.6704 
## F-statistic: 257.8 on 4 and 501 DF,  p-value: < 2.2e-16
modelpoly5 <- lm(medv~poly(lstat,5,raw=TRUE), data = boston)
summary(modelpoly5)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 5, raw = TRUE), data = boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6.770e+01  3.604e+00  18.783  < 2e-16 ***
## poly(lstat, 5, raw = TRUE)1 -1.199e+01  1.526e+00  -7.859 2.39e-14 ***
## poly(lstat, 5, raw = TRUE)2  1.273e+00  2.232e-01   5.703 2.01e-08 ***
## poly(lstat, 5, raw = TRUE)3 -6.827e-02  1.438e-02  -4.747 2.70e-06 ***
## poly(lstat, 5, raw = TRUE)4  1.726e-03  4.167e-04   4.143 4.03e-05 ***
## poly(lstat, 5, raw = TRUE)5 -1.632e-05  4.420e-06  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16
modelpoly6 <- lm(medv~poly(lstat,6,raw=TRUE), data = boston)
summary(modelpoly6)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 6, raw = TRUE), data = boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7317  -3.1571  -0.6941   2.0756  26.8994 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  7.304e+01  5.593e+00  13.059  < 2e-16 ***
## poly(lstat, 6, raw = TRUE)1 -1.517e+01  2.965e+00  -5.115 4.49e-07 ***
## poly(lstat, 6, raw = TRUE)2  1.930e+00  5.713e-01   3.378 0.000788 ***
## poly(lstat, 6, raw = TRUE)3 -1.307e-01  5.202e-02  -2.513 0.012295 *  
## poly(lstat, 6, raw = TRUE)4  4.686e-03  2.407e-03   1.947 0.052066 .  
## poly(lstat, 6, raw = TRUE)5 -8.416e-05  5.450e-05  -1.544 0.123186    
## poly(lstat, 6, raw = TRUE)6  5.974e-07  4.783e-07   1.249 0.212313    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.212 on 499 degrees of freedom
## Multiple R-squared:  0.6827, Adjusted R-squared:  0.6789 
## F-statistic: 178.9 on 6 and 499 DF,  p-value: < 2.2e-16
boston$pr1 <- predict(model13,newdata=boston)
boston$pr5 <- predict(modelpoly5,newdata=boston)

ggplot(boston)+geom_point(aes(lstat,medv))+geom_line(aes(lstat,pr1),color="blue",size=2)+geom_line(aes(lstat,pr5),color="red",linetype="solid",size=2)

  1. Orley Ashenfelter in his paper “Predicting the Quality and Price of Bordeaux Wines” published in The Economic Journal showed that the variability in the prices of Bordeaux wines is predicted well by the weather that created the grapes. In this question, you will validate how these results translate to a dataset for wines produced in Australia. The data is provided in the file winedata.csv. The dataset contains the following variables:
  1. Define two new variables age91 and age92 that captures the age of the wine (in years) at the time of the auctions. For example, a 1961 wine would have an age of 30 at the auction in 1991. What is the average price of wines that were 15 years or older at the time of the 1991 auction?
    Answer. The average price of wine that were 15 years or older at the 1991 auction is $96.44.
wine<-read.csv("winedata.csv")
str(wine)
## 'data.frame':    25 obs. of  7 variables:
##  $ vintage : int  1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 ...
##  $ price91 : num  131.5 156 118 92.8 119.6 ...
##  $ price92 : num  132 181 180 115 157 ...
##  $ temp    : num  18.8 18.8 18 NA 18.6 ...
##  $ hrain   : num  20.3 41 42.1 41.1 0.5 ...
##  $ wrain   : num  328 411 453 372 374 ...
##  $ tempdiff: num  7.36 7.21 6.84 7.76 8.38 7.26 6.88 7.2 6.58 7.47 ...
wine$age91<-1991-wine$vintage
wine$age92<-1992-wine$vintage
mean(subset(wine$price91,wine$age91>=15))
## [1] 96.43563
  1. What is the average price of the wines in the 1991 auction that were produced in years when both the harvest rain was below average and the temperature difference was below average?
    Answer. The average price in 1991 when harvest rain and temperature difference were below average is $72.87.
mean(subset(wine$price91,wine$hrain<mean(wine$hrain)&wine$tempdiff<mean(wine$tempdiff)))
## [1] 72.86714
  1. In this question, you will develop a simple linear regression model to fit the log of the price at which the wine was auctioned in 1991 with the age of the wine. To fit the model, use a training set with data for the wines up to (and including) the year 1981. What is the R-squared for this model?
    Answer. \(R^2\) for this model is 0.6675.
train<-subset(wine,vintage<=1981)
model1<-lm(log(price91)~age91,data=train)
summary(model1)
## 
## Call:
## lm(formula = log(price91) ~ age91, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26897 -0.13328  0.01939  0.10452  0.41913 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.571442   0.144163  24.774 6.30e-16 ***
## age91       0.042610   0.006899   6.176 6.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1914 on 19 degrees of freedom
## Multiple R-squared:  0.6675, Adjusted R-squared:   0.65 
## F-statistic: 38.15 on 1 and 19 DF,  p-value: 6.188e-06
  1. Find the 99% confidence interval for the estimated coefficients from the regression.
    Answer.
    For intercept (\(\beta_0\)): [3:159; 3:98].
    For age(\(\beta_1\)): [0:022; 0:062].
confint(model1, level = 0.99)
##                  0.5 %     99.5 %
## (Intercept) 3.15900093 3.98388288
## age91       0.02287254 0.06234705
  1. Use the model to predict the log of prices for wines made from 1982 onwards and auctioned in 1991. What is the test R-squared?
    Answer. Test \(R^2=0.9213742.\)
test<-subset(wine,vintage>=1982)
predtest<-predict(model1,newdata=test)
sse<-sum((log(test$price91)-predtest)^2)
sst<-sum((log(test$price91)-mean(log(train$price91)))^2)
testR2<- 1-sse/sst
testR2
## [1] 0.9213742
  1. Which among the following options describes best the quality of fit of the model for this dataset in comparison with the Bordeaux wine dataset that was analyzed by Orley Ashenfelter?
    • The result indicates that the variation of the prices of the wines in this dataset is explained much less by the age of the wine in comparison to Bordeaux wines.
    • The result indicates that the variation of the prices of the wines in this dataset is explained much more by the age of the wine in comparison to Bordeaux wines.
    • The age of the wine has no predictive power on the wine prices in both the datasets.
      Answer. In comparison to the results for the Bordeaux wine data, the training (model) \(R^2\) and test \(R^2\) is higher for this new dataset. This seems to indicate that the variation in the prices of the wine in this dataset is explained much more by the age of the wines in comparison to the Bordeaux dataset.
  2. Construct a multiple regression model to fit the log of the price at which the wine was auctioned in 1991 with all the possible predictors (age91, temp, hrain, wrain, tempdiff) in the training dataset. To fit your model, use the data for wines made up to (and including) the year 1981. What is the R-squared for the model?
    Answer. For this model \(R^2=0.7938.\)
model2<-lm(log(price91)~temp+hrain+wrain+tempdiff+age91,data=train)
summary(model2)
## 
## Call:
## lm(formula = log(price91) ~ temp + hrain + wrain + tempdiff + 
##     age91, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32298 -0.07019 -0.01634  0.11051  0.24455 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.2626138  1.4914578   1.517   0.1532    
## temp         0.1135015  0.0734344   1.546   0.1462    
## hrain       -0.0028825  0.0016205  -1.779   0.0987 .  
## wrain        0.0002520  0.0003918   0.643   0.5313    
## tempdiff    -0.1213280  0.1445947  -0.839   0.4166    
## age91        0.0482331  0.0075346   6.402 2.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1803 on 13 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7938, Adjusted R-squared:  0.7145 
## F-statistic: 10.01 on 5 and 13 DF,  p-value: 0.000421
  1. Is this model preferred to the model with only the age variable as a predictor (use the adjusted R-squared for the model to decide on this)?
    Answer. With only the age variable, adjusted \(R^2=0.65\). On the other hand, with all the variables, adjusted \(R^2=0.7145\). This seems to indicate that the latter model (with more variables included) is preferred.

  2. Which among the following best describes the output from the fitted model?

    • The result indicates that less the temperature, the better is the price and quality of the wine
    • The result indicates that greater the temperature difference, the better is the price and quality of wine.
    • The result indicates that lesser the harvest rain, the better is the price and quality of the wine.
    • The result indicates that winter rain is a very important variable in the fit of the data.
      Answer. The result indicates that the lesser the harvest rain, the better the price and the quality of the wine will be. This is because the corresponding \(\beta=-0.003\) and is significant at the 10% level. All other statements appear to be false.
  3. Of the five variables (age91, temp, hrain, wrain, tempdiff), drop the two variables that are the least significant from the results in (g). Rerun the linear regression and write down your fitted model.
    Answer. The least significant variables are wrain and tempdiff with p-values 0.53 and 0.416 respectively and we create model3 removing the two.

model3<-lm(log(price91)~temp+hrain+age91,data=train)
summary(model3)
## 
## Call:
## lm(formula = log(price91) ~ temp + hrain + age91, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.32198 -0.09905  0.00491  0.14536  0.29828 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.801013   1.297697   1.388    0.185    
## temp         0.097500   0.068750   1.418    0.177    
## hrain       -0.001983   0.001236  -1.604    0.130    
## age91        0.045670   0.006702   6.814 5.85e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1752 on 15 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.7753, Adjusted R-squared:  0.7304 
## F-statistic: 17.26 on 3 and 15 DF,  p-value: 3.973e-05
  1. Is this model preferred to the model with all variables as predictors (use the adjusted R-squared in the training set to decide on this)?
    Answer. In the training set, adjusted \(R^2\) for this model is 0.73 while for model2, adjsuted \(R^2\) is 0.7145. In this case, the new model3 is preferred to model2.

  2. Using the variables identified in (j), construct a multiple regression model to fit the log of the price at which the wine was auctioned in 1992 (remember to use age92 instead of age91). To fit your model, use the data for wines made up to (and including) the year 1981. What is the R-squared for the model?
    Answer. \(R^2\) for this model is 0.5834.

model4<-lm(log(price92)~temp+hrain+age92,data=train)
summary(model4)
## 
## Call:
## lm(formula = log(price92) ~ temp + hrain + age92, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29394 -0.15545  0.03238  0.10221  0.37661 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.723393   1.574961   1.729 0.104296    
## temp         0.072494   0.083333   0.870 0.398043    
## hrain       -0.001539   0.001498  -1.027 0.320713    
## age92        0.035237   0.008124   4.338 0.000586 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2123 on 15 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.5834, Adjusted R-squared:  0.5001 
## F-statistic: 7.002 on 3 and 15 DF,  p-value: 0.003621
  1. Suppose in this application, we assume that a variable is statistically significant at the 0.2 level. Would you reject the hypothesis that the coefficient for the variable hrain is zero?
    Answer. The p-value for hrain is 0.32. Hence we canot reject the null hypothesis that the coefficient for hrain is zero.

  2. By separately estimating the equations for the wine prices for each auction, we can better establish the credibility of the explanatory variables because:

    • We have more data to fit our models with.
    • The effect of the weather variables and age of the wine (sign of the estimated coeffi- cients) can be checked for consistency across years.
    • 1991 and 1992 are the markets when the Australian wines were traded heavily. Select the best option.
      Answer. The best explanation seems to be that we can check for consistency of the effect of weather variables and age by looking at the sign of the estimated coefficients.
  3. The current fit of the linear regression using the weather variables drops all observations where any of the entries are missing. Provide a short explanation on when this might not be a reasonable approach to use.
    Answer. Clearly, dropping missing entries is reliable. However, if there are many missing entries, then this implies we can lose a lot of data.

Questions on PCA
  1. This question involves the use of principal component analysis on the well-known iris dataset. The dataset is available in R.
  1. How many observations are there in the dataset? What are the different fields/attributes in the data set?
    Answer. There are 150 observations. There are four different attributes (of type numeric): Sepal length, Sepal width, Petal length and Petal width. The fifth field gives the Species which are of three types: setosa, versicolor, virginica.
dim(iris)
## [1] 150   5
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
  1. Create a new dataset iris.data by removing the Species column and store its content as iris.sp.
 iris.data<-iris[,-5]
 iris.sp<-iris[,5]
  1. Compare the various pair of features using a pairwise scatterplot and find correlation coefficients between the features. Which features seem to be highly correlated?
    Answer. We use the package psych and pairs.panel function.
    Petal length and width seems highly positively correlated. Sepal length and width are also highly positively correlated. Sepal length and petal width also seem quite correlated.
library(psych)
pairs.panels(iris.data, ellipses = F, lm =T, breaks=10, hist.col="blue")

  1. Conduct a principal component analysis on iris.data without standardizing the data. You may use prcomp(..., scale=F).
    1. How many principal components are required to explain at least 90% of the variability in the data? Plot the cumulative percentage of variance explained by the principal components to answer this question.
      Answer. Clearly in this case, the first principal component already explains more than 90% of the variability in the data set.
pr.out<-prcomp(iris.data,scale=F)
summary(pr.out) 
## Importance of components:
##                           PC1     PC2    PC3     PC4
## Standard deviation     2.0563 0.49262 0.2797 0.15439
## Proportion of Variance 0.9246 0.05307 0.0171 0.00521
## Cumulative Proportion  0.9246 0.97769 0.9948 1.00000
pr.out$sdev
## [1] 2.0562689 0.4926162 0.2796596 0.1543862
pve<-pr.out$sdev^2/sum(pr.out$sdev^2)
cpve<-cumsum(pve)
pve
## [1] 0.924618723 0.053066483 0.017102610 0.005212184
cpve
## [1] 0.9246187 0.9776852 0.9947878 1.0000000
plot(cpve,xlab="Principal components",type="l",ylim=c(0.7,1))

In general it is suggested that data be scaled prior to conducting PCA.
ii) Plot the data along the first two principal components and color the different species separately. Does the first principal component create enough separation among the different species? To plot, you may use the function fviz_pca_ind or fviz_pca_biplot in library(factoextra). Alternatively, you may use biplot or construct a plot using ggplot2 as well.
Answer. The first principal component already seems to create a separation between setosa and the other two categories, although versicolor seems to be quite close to virginica. Moreover, it appears the same proportions of Petal length and Petal width are used in both PC1 and PC2 (the only difference is the magnitude). We can also create 95% confidence bounds of ellipses around the groups to see how much they are separated.

library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_pca_biplot(pr.out, label = "var", habillage=iris.sp)

#fviz_pca_biplot(pr.out, label = "var", habillage=iris.sp,addEllipses=TRUE, ellipse.level=0.95)
  1. Do the same exercise as in (d) above, now after standardizing the dataset. Comment on any differences you observe.
    Answer. From the standardized PCA, we observe that it requires two principal components to explain more than 90% of the variability in the data (actually then we explain 95%).

When data is standardized, one guideline to pick the number of relevant principal components is to pick the ones with eigen-values (the variances of each components) greater than 1 (which means they explain at least one variable). Check that sum of the eigen-values (variances) is the number of variables (which is four here).

From the plots we can see that after standardizing, the second principal component also contributes to explaining variability. Since the Petal lengths were already contributing to bigger numbers, a non-standardized version gave much more importance to this attribute. In the standardized version both Petal length and width tends to get same importance (since they are also quite correlated). Moreover Sepal width gets more prominent weights in principal component 1 as well as in principal component 2. The first two components having variances 1.71 and 0.96, it will natural to choose two principal components here for explaining data variablitiy.

pr.out.sc<-prcomp(iris.data,scale=T)
summary(pr.out.sc) 
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
pr.out.sc$sdev
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
sum((pr.out.sc$sdev)^2)
## [1] 4
plot(pr.out.sc$sdev^2)+abline(h=1,col="red")

## integer(0)
pve.sc<-pr.out.sc$sdev^2/sum(pr.out.sc$sdev^2)
cpve.sc<-cumsum(pve.sc)
pve.sc
## [1] 0.729624454 0.228507618 0.036689219 0.005178709
cpve.sc
## [1] 0.7296245 0.9581321 0.9948213 1.0000000
plot(cpve.sc,xlab="Principal components",type="l",ylim=c(0.4,1))

# compare with the unscaled case:
#plot(cpve,xlab="Principal components",type="l",ylim=c(0.4,1))
library(factoextra)
fviz_pca_biplot(pr.out.sc, label = "var", habillage=iris.sp)

fviz_pca_biplot(pr.out.sc, label = "var", habillage=iris.sp,addEllipses=TRUE, ellipse.level=0.95)

7. This problem involves the dataset wine$\_$italy.csv which was obtained from the University of Irvine Machine Learning Repository. These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different . The analysis determined the quantities of 13 constituents found in each of the three types of wines. The first column identifies the cultivars and the next thirteen are the attributes given by: * alcohol: Alcohol * malic: Malic acid * ash: Ash * alkalin: Alkalinity of ash * mag: Magnesium * phenols: Total phenols * flavanoids: Flavanoids * nonflavanoids: Nonflavanoid phenols * proanth: Proanthocyanins * color: Color Intensity * hue: Hue * od280: OD280/ OD315 of diluted wines * proline: Proline

  1. Check the relationship between the variables by creating a pair-wise scatterplot of the thirteen attributes.
    Answer. From the scatterplot we see flavanoids and total phenols are quite positively correlated. Falvanoinds are also quite positively correlated with proanthocyanins
wineitaly<-read.csv("wine_italy.csv",sep=",",head=T)
str(wineitaly)
## 'data.frame':    178 obs. of  14 variables:
##  $ class        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ alcohol      : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ malic        : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ ash          : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ alkalin      : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ magnesium    : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ phenols      : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ flavanoids   : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ nonflavanoids: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ proanth      : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ color        : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ hue          : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ od280        : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ proline      : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
wine.it<-wineitaly[,-1]
wine.cl<-wineitaly[,1]

pairs.panels(wine.it, ellipses = F, lm =T, breaks=10, hist.col="blue")

b. Conduct a principal component analysis on the standardized data. What proportion of the total variability in the data is explained by the first two components?

Answer. The first two components explain 55% of the total variation.

pr.wine<-prcomp(wine.it,scale=T)
summary(pr.wine) 
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.169 1.5802 1.2025 0.95863 0.92370 0.80103 0.74231
## Proportion of Variance 0.362 0.1921 0.1112 0.07069 0.06563 0.04936 0.04239
## Cumulative Proportion  0.362 0.5541 0.6653 0.73599 0.80162 0.85098 0.89337
##                            PC8     PC9   PC10    PC11    PC12    PC13
## Standard deviation     0.59034 0.53748 0.5009 0.47517 0.41082 0.32152
## Proportion of Variance 0.02681 0.02222 0.0193 0.01737 0.01298 0.00795
## Cumulative Proportion  0.92018 0.94240 0.9617 0.97907 0.99205 1.00000
pr.wine$sdev
##  [1] 2.1692972 1.5801816 1.2025273 0.9586313 0.9237035 0.8010350 0.7423128
##  [8] 0.5903367 0.5374755 0.5009017 0.4751722 0.4108165 0.3215244
  1. Plot the data along the first two principal components and color the different cultivars separately. Also plot the loadings of the different components to show the importance of the different attributes on the first two principal components?
    Answer.
#library(factoextra)
fviz_pca_biplot(pr.wine, label = "var", habillage=wine.cl)

fviz_pca_biplot(pr.wine, label = "var", habillage=wine.cl,addEllipses=TRUE, ellipse.level=0.95)

i) Which two key attributes differentiate Cultivar 2 from the other two cultivars?
Answer. From the principal component loadings for each attribute, we can see that Cultivar 2 (green triangles) has lighter color intensity and lower alcohol content (plotted opposite to the directions of these attributes in the PC1-PC2 axes). They also have a higher OD280/OD315 ratio and stronger hue than the other cultivars (but these weights are a little less that the first two).

pr.wine$rotation[order(pr.wine$rotation[,1],decreasing=T),1:2] ## order according to PC1
##                        PC1          PC2
## nonflavanoids  0.298533103  0.028779488
## malic          0.245187580  0.224930935
## alkalin        0.239320405 -0.010590502
## color          0.088616705  0.529995672
## ash            0.002051061  0.316068814
## magnesium     -0.141992042  0.299634003
## alcohol       -0.144329395  0.483651548
## proline       -0.286752227  0.364902832
## hue           -0.296714564 -0.279235148
## proanth       -0.313429488  0.039301722
## od280         -0.376167411 -0.164496193
## phenols       -0.394660845  0.065039512
## flavanoids    -0.422934297 -0.003359812
pr.wine$rotation[order(pr.wine$rotation[,2],decreasing=T),1:2] ## order according to PC2
##                        PC1          PC2
## color          0.088616705  0.529995672
## alcohol       -0.144329395  0.483651548
## proline       -0.286752227  0.364902832
## ash            0.002051061  0.316068814
## magnesium     -0.141992042  0.299634003
## malic          0.245187580  0.224930935
## phenols       -0.394660845  0.065039512
## proanth       -0.313429488  0.039301722
## nonflavanoids  0.298533103  0.028779488
## flavanoids    -0.422934297 -0.003359812
## alkalin        0.239320405 -0.010590502
## od280         -0.376167411 -0.164496193
## hue           -0.296714564 -0.279235148
ii) Which two key attributes differentiate Cultivar 3 from the other two cultivars?  

Answer. Similarly we can see that Cultivar 3 (blue squares) has higher content of malic acid and non-flavanoids, as well as higher alkalinity especially with respect to cultivar 1.

  1. Use an appropriate plot to find the number of attributes required to explain at least 80% of the total variation in the data. How many principal components would you pick to explain the variability in the data reasonably? Answer. We use a plot of the cumulative percentage of variance explained first. From the plot (the red vertical line is at the level 0.8 and the blue horizontal line at index 5) and the calculations, we can see that the first five principal components are necessary to explain at least 80% of the variability. We also make a scree plot and observe that only four of the variances are greater than 1, hence we can decide on choosing 4 principal components to reasonable explain the variablitly in the data.
pve.w<- pr.wine$sdev^2/sum(pr.wine$sdev^2)
cpve.w<- cumsum(pve.w)
pve.w
##  [1] 0.361988481 0.192074903 0.111236305 0.070690302 0.065632937 0.049358233
##  [7] 0.042386793 0.026807489 0.022221534 0.019300191 0.017368357 0.012982326
## [13] 0.007952149
cpve.w
##  [1] 0.3619885 0.5540634 0.6652997 0.7359900 0.8016229 0.8509812 0.8933680
##  [8] 0.9201754 0.9423970 0.9616972 0.9790655 0.9920479 1.0000000
plot(cpve.w,xlab="Principal components",type="l",ylim=c(0,1))+abline(h=0.8,col="red")+abline(v=5,col="blue")

## integer(0)
plot(pr.wine,type="l",ylim=c(0,5),main="Scree plot")+abline(h=1,col="red") 

## integer(0)
End of Exercise